-
-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplementation of end symbol (Issue #237) #880
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember what the problem was last time you tried to implement this? And is it fixed?
for token in self.lexer_state.lex(self.parser_state): | ||
self.parser_state.feed_token(token) | ||
|
||
def feed_eof(self, last_token=None): | ||
"""Feed a '$END' Token. Borrows from 'last_token' if given.""" | ||
eof = Token.new_borrow_pos('$END', '', last_token) if last_token is not None else Token('$END', '', 0, 1, 1) | ||
"""Feed a END Token. Borrows from 'last_token' if given.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, overlooked this in the review: Shouldn't we feed multiple ENDs here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think you're right.
I don't remember the problem, only that it messed some grammars up. But I changed the LALR logic since, made it a little cleaner, so maybe that was solved on the way? I'll make sure to do thorough tests before merging it in. |
I'm trying to test this right now. What's the terminal to use? |
No, It is just |
Maybe I'm going too far but is this supposed to work with custom Indenters? Because I'm facing a problem where
|
It is supposed to work. Can you give a short example script? |
|
Now it's just really slow. Or maybe it's just my grammar.... EDIT: Ok, never mind. Apparently, it was kinda hanging when I did (value (_NL|$|_DEDENT)+ ) but removing (I'm making a programming language-type thing. It's whitespace sensitive) SECOND EDIT: So I'm trying to make an error and there we go: Traceback (most recent call last):
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 127, in feed_token
action, arg = states[state][token.type]
KeyError: 'IDENTIFIER'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/bryanhu/projects/langs/tylaireum/tylaireum/__main__.py", line 79, in <module>
print(parser.grammar.parse(INLINE).pretty())
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/lark.py", line 552, in parse
return self.parser.parse(text, start=start, on_error=on_error)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parser_frontends.py", line 107, in parse
return self.parser.parse(stream, start, **kw)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 42, in parse
return self.parser.parse(lexer, start)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 176, in parse
return self.parse_from_state(parser_state)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 193, in parse_from_state
raise e
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 184, in parse_from_state
state.feed_token(token)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 130, in feed_token
raise UnexpectedToken(token, expected, state=self, interactive_parser=None)
^Clark.exceptions.UnexpectedToken: <exception str() failed> But as you can see, I had to I am too lazy to send a minimal reproducible example right now, so if you want my grammar, etc, I'll send the whole enchilada. |
@ThatXliner Thanks for finding that. I think I fixed it, so try now. |
nope. EDIT: Though I noticed it wasn't hanging for valid input this time |
@ThatXliner Ok then, hit me with your enchilada. But make sure it's easy to see it fail. |
Probably it was hanging for valid input when I sent my last comment but I most likely didn't double-check. Now, when I make it valid, it hangs. Here's the enchilada: https://gist.github.com/ThatXliner/96829ce180b5659198a822cf74b6c309 Tell me if you can reproduce or not |
I found a minimal breaking example: from lark import Lark
p = Lark("""
?start: "foo" x "bar"
x: $ x
""", parser="lalr")
p.parse("foo") The infinite loop happens because I need to think about how to solve this. My current intuition is that this grammar shouldn't even be allowed. But I'm not sure how easy it is to detect reliably. But maybe I just need to rethink the loop mechanism, and solve it by only providing the end token once. |
I'm wondering how you're implementing the end token for right now. I think it's better to parse normally, get an EOF error, see if the grammar expects it and if so, continue parsing. Unless that's what you're doing right now. |
So, it seems that this can be "solved" by testing that the parser only shifts once on END, and otherwise throwing a run time error for infinite loop. But I don't like it. Lark shouldn't allow to compile a grammar that can never run to completion. But, I'm not sure how to detect it at compile time. |
@@ -132,7 +133,7 @@ def feed_token(self, token, is_end=False): | |||
|
|||
if action is Shift: | |||
# shift once and return | |||
assert not is_end | |||
# assert not is_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder what would happen if you undid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just an assert. It doesn't matter.
Hi |
@seimit There are issues with this PR, so currently no. The problem is that in some edge cases, the parser gets stuck in a loop forever. Which is something I never want Lark to be able to do. |
@erezsh Shouldn't it be possible to add some early point reject all BNF rules that have a Terminal following an
we completely throw it away, since it can never actually match. For rules of the form
We need to check if |
It's a good point that we can detect it during grammar analysis. And then we can just throw an error if anything follows the $END symbol. I don't think there's a good enough reason to allow them, empty rules or not. |
Yes there is. |
OTOH, maybe it shouldn't be allowed. In effect it is the same as But also, can we somehow detect this:
? |
If we do it at the grammar analysis level, we can make sure |
Hmm actually, no, that's the crux of the problem, which I guess is why I left it open. On one hand, the grammar has to allow the option of something after an end token, otherwise, there's no advantage in implementing it. On the other hand, it's not obvious how to block the bad uses while allowing the good ones. |
The point is more that there is that it is possible to accept nothing after |
Throwing in my $0.02, just to say that this feature would be useful to me. |
No description provided.